-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[DRAFT] Extension Type Registry Draft #18552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
I think this is a great demonstration of the components we need: something that represents a "type" in the way that we actually mean, something to instantiate it from name/metadata/storage type, and some concrete problem that can be solved by combining those.
It might be a nice addition to this demonstration the code required to integrate this to Expr such that Display for an Expr::Literal or the CLI printer can make use of this (maybe an optimizer rule could hydrate any unresolved logical types?). I like the LogicalType as an endpoint but it is a valid concern that it might disrupt the existing logical Expr and such a demonstration might help alleviate that concern.
The ExprField idea I had was basically to start with struct ExprField { inner: FieldRef } (breaking change) and slowly migrate to struct ExprField { logical_type: LogicalType, nullability: bool, ... <other stuff?>} in a non-breaking way. It's very possible that going straight to the LogicalType is just as easy to do (I haven't tried it!).
| /// Represents the canonical [Opaque extension type](https://arrow.apache.org/docs/format/CanonicalExtensions.html#opaque). | ||
| /// | ||
| /// In the context of DataFusion, a common use case of the opaque type is when an extension type | ||
| /// is unknown to DataFusion. Contrary to [UnresolvedExtensionType], the extension type has | ||
| /// already been checked against the extension type registry and was not found. | ||
| impl LogicalType for Opaque { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure mixing the canonical extension "Opaque" and Opaque here is a good idea, but I do think this logical type is a good one to have...it is a place where we can define what happens to a field with ARROW:extension:name in the absence of any other guidance (e.g., here we would print values identically to the storage type...we could define equality here using byte-for-byte equality of metadata if we want to be strict, or using storage type equality if we want to be lenient).
| fn pretty_print_scalar(&self, value: &ScalarValue) -> Result<String> { | ||
| Ok(format!("datafusion.unresolved({value})")) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this error? Like, if this code is called, it meant a programming error led to the logical type not being resolved?
Or maybe this should be unified with the Opaque logical type for now (in other words, we'll always construct Opaque extensions which have acceptable default behaviour, and work towards ensuring they are resolved everywhere).
Which issue does this PR close?
This is a draft for #18223 . The APIs are not to be considered final (e.g., options are missing in the pretty printer).
The primary purpose is to spark discussion for now.
So happy to hear inputs!
Rationale for this change
How cool would it be to just state that you should properly format my byte-encoded uuids? :)
What changes are included in this PR?
LogicalTypetrait for some canonical extension types from arrow.UnresolvedExtensionType, a "DataFusion canonical extension type" that can be used to create aLogicalTypeinstance even without a registry. For example, thenew_...functions forDFSchemacould make use of this type, as they currently have no access to a registry. Furthermore, these function could directly instantiate the canonical arrow extension types as they are known to the system. Then the functions could resolve native and canonical extension types themselves without an access to the registry and then "delay" the resolving of the custom extension types. The idea is that there is then a "Type Resolver Pass" that has access to a registry and replaces all instances of this type with the actual one. While I hope that this is only a temporary solution until all places have access to a logical type registry, I think this has the potential to become a "permanent temporary solution". With this in mind, we could also consider making this explicit with an enum and not hide it behind dynamic dispatch.ValuePrettyPrinterfor showcasing the UUID pretty printing.ExtensionTypeRegistryinSessionStateWhat is also important is what is not included: an integrative example of making use of the pretty printer. I tried several avenues but, as you can imagine, each change to the core data structure is a huge plumbing effort (hopefully reduced by the existence of
UnresolvedLogicalType).I really like the suggestion by @paleolimbot to use pretty-printing record batches as the first use case. You can see a mini example in the test that pretty-prints UUIDs. The nice thing is that this probably would not require much plumbing as the [DataFrame] already has access to the [SessionState]. The only thing that's missing for me to actually include this example here is that
arrow-rsdoes not currently support passing custom pretty printers inpretty_format_batches_with_options.Imagine that the
to_stringfunction in theDataFramedoes the following:If you think this is a worthwhile pursuit we could add the capability to arrow-rs.
Are these changes tested?
Not really, as there is not integrative example yet.
Are there any user-facing changes?
There would be.